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Abstract 

We study the capability to learn and to generate long-range, power-law corre- 
lated sequences by a fully connected asymmetric network. The focus is set on 
the ability of neural networks to extract statistical features from a sequence. 
We demonstrate that the average power-law behavior is learnable, namely, 
the sequence generated by the trained network obeys the same statistical be- 
havior. The interplay between a correlated weight matrix and the sequence 
generated by such a network is explored. A weight matrix with a power-law 
correlation function along the vertical direction, gives rise to a sequence with 
a similar statistical behavior. 

PACS numbers: 84.35. +i, 05.10.-a 

I. INTRODUCTION 

Real-life (temporal) sequences are characterized by a certain degree of correlation. It is 
known that a wide range of systems in nature displays long-range correlations, e.g., biological 
- DNA sequences and heartbeat intervals, natural - languages, etc., see Since long- 

range correlations can appear in many forms, we restrict the analysis to the case of power-law 
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correlations in a random sequence, e.g., the correlation function for a ID sequence X{ is given 
by 

C{\) = ( Xi x i+l > oc Z-t (Z -> oo , 7 > 0) , (1) 

where the angular brackets denote an average over the randomness. This type of random 
sequence is also termed "colored" or correlated-noise. The general form of description allows 
us to investigate the capability of the network to capture statistical properties of a sequence. 

The theory of learning from examples by a neural network, and in particular on-line 
learning, has been developed almost exclusively for uncorrelated patterns, see |||4j . Though 
some particular cases of correlated patterns were treated, they were limited to simple spatial 
correlations within each pattern, or to temporal correlations of each input unit, e.g., ||. The 
case of long-range correlations is absent. Clearly, the problem of extracting a feature from a 
correlated sequence whose length is much larger than the network's size, cannot be treated 
under the same assumptions. Rather than dealing with the question of the generalization 
error (average over a distribution of patterns), we focus on the capability of the network 
to asymptotically capture the correlations within the sequence and its ability to generate a 
sequence with similar properties. 

As shown previously, the sequence generator (SGen), a continuous- valued feed-forward 
network in which the next state vector is determined from past output values, exhibits 
(quasi) periodic attractors in the stable regime, regardless of the complexity of the weights, 
both in the case of a perceptron as well as multilayer SGen's, e.g., |6|-§; the unstable, chaotic 
regime is studied in 0. Therefore, it is obvious that the perceptron-SGen, or its extension 
to multilayer networks, are not suitable candidates for learning and generating correlated 
noise. The natural way to overcome this limitation is to increase the complexity of the 
feedback in the architecture. In this paper we study a fully connected, asymmetric network. 
The updating rule for the network's state is either sequential or parallel, namely, each unit 
is updated on its turn or all units are updated simultaneously Unit i (i = 1,...,N) is 
updated as follows 
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Sl +1 = tanh (3 
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^ t+1 = tanh^f:^^ (3) 

where eq. [2](|3]) refers to the sequential(parallel) rule; W is an [NxN] weight matrix and (3 
is a gain parameter. The network generates iteratively an infinite sequence {cx m } starting 
from an initial state, S°, as follows: 

a m = S^ 1 , m = tN + i = 1, 2, . . . , (4) 

where i = l,...,N, t = 0,1,2... . 

Two complementary issues are discussed in this paper: (a) Given a training sequence 
characterized by long-range correlations, can we train a network in an on-line scheme to gen- 
erate a sequence with the same asymptotic statistical properties? (b) The inverse problem, 
is there an interplay between a network whose weight matrix follows a power-law correlation 
function, and the sequence it generates? 

It is important to stress that the model we investigate is not proposed as a practical 
method for generating long-range correlated sequences, rather, it is motivated by the issues 
raised above. 

In the next section we investigate the first question. An "on-line" , gradient-based learning 
rule is applied where each example is presented to the network only once. The sequences 
that constitute the basis for the training patterns and the correlated weight matrices, are 
generated using an algorithm for re-shaping the power spectrum of an uncorrelated sequence; 



the method has been developed for investigating various stochastic processes, see fllP] . In 
section [TTI| the inverse problem is analyzed. A method for constructing the weight matrix is 
presented based on the findings obtained from section |TT] regarding the correlation properties 
of the weights in trained networks. A simple analytical derivation support these findings. 
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II. GENERALIZING THE RULE OF A COLORED SEQUENCE 



Suppose a source generating sequences that obey eq. |l|; the question we address in this 
section focuses on the possibility of learning the statistical properties of the source, and in 
particular the exponent 7 of the power-law correlation function. For a network defined by 
its weights W, a gain j3 and a nonlinear function /, the response of the ith unit, given the 
current state £*, is 

sT 1 = /(**, w,), 

where W, denotes the vector of weights connected to the ith unit. The on-line learning 
algorithm minimizes a quadratic error function 

ei(r,W < ) = [fir 1 -r^ 1 ] a /2 , (5) 

where r* +1 is the desired response of the ith unit given the state £*. The weights are updated 
according to 

Wf+^Wl-^w^W*) , (6) 

i.e., a gradient descent rule with a learning rate 77 (similar results were obtained using the 
Hebbian learning rule). 

The training patterns are defined as follows. Let Dl = {21, x%, . . . , xl} be a ID sequence 
obeying eq. [I]. A training pattern (the pair (£ m , r™) < m < L — 2N + 1, % = 1, . . . , N) 
is defined by 

£ {Xmi X m +li • • ■ 1 X-m+N —l) 

(7) 

_m „ 

T i — -im+iV+i-1 

where each weight vector W; is updated with the corresponding desired output r™, and the 
same vector £ m . Updating all iV weight vectors for a given pattern, (£ m , r™ i = 1, . . . , iV), 
accounts for a single training cycle. The patterns for consecutive training cycles are achieved 
via sliding a window of size iV by one site along the sequence Dl; e.g., given a current 
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pattern starting at site m along the training sequence, £ m (eq. |7|), the next pattern is 
£m+i _ ( Xm+lj x m+ 2, • • • , x m+ 7v)- The training patterns can be obtained in a different scheme 
by sliding the window iV sites (non-overlapping windows); i.e., 

S — \%mN+lj x mN+2i ■ ■ ■ i x mN+N) 

(8) 

T ™ +1 = x {m+1)N+i m = 0, 1, . . . 

Results obtained in both schemes are similar, however, the length of the sequence (-Dl) used 
in the second scheme (to obtain the same results) is about N times larger. 

Let us now describe our numerical investigation. An ensemble of long sequences, Dl (of 
size L ^> N), that obey eq. [l| with a given 7 is generated. A randomly chosen network is 
trained using a part of the sequence. Taking the last pattern from the training process as 
an initial state, the trained network is used to generate iteratively a long sequence {cr m }, 
of size MN with M = 10, following the sequential rule, eq. [| and eq. |j. The correlation 
function of this sequence can be calculated in the two following ways: spatial and temporal. 
The spatial correlation is obtained by averaging the correlation function, calculated on (M) 
sequences of size N after updating all (N) units, over the M iteration cycles, whereas the 
temporal correlation is simply the correlation function of the long sequence, i.e., 



(9) 



C spat (l) = 1/MEfLo l/JVEi =1 ®jN+i &jN+(i+l mod N) 

C temp (l) = {MN)- x Y,ti°i°i+i , 
where we take periodic boundary conditions. Next, the same weight matrix is further trained 
and after each aN patterns (a = 10) , the same process of "generating a sequence and 
calculating its correlation function", is repeated for a better statistical estimation. We 
found that both definitions of the correlation function yield similar results, therefore, in the 
sequel we omit the subscript from eq. || and refer to the temporal correlation function 

C(l) = C temp {l) . (10) 

Note that the range of correlations is bounded by the number of degrees of freedom {S^}^, 



as m 



10 1; therefore, the correlation function is calculated in the range < / < N/2 (a 



symmetric function). The whole procedure is applied for all members of the ensemble. This 
extensive averaging is necessary since the patterns £ m taken from the long sequences Dl, 
exhibit large fluctuations (recall that iV <C L and the variance decreases linearly with the 
size of the sequence). 

Figure [TJ depicts the results of the above procedure for 7 = 0.4,0.6,0.8 (N = 200), 
with L pa 10 5 and an ensemble of 50-100 samples. For comparison, we show the results of 
training with patterns obtained by sliding the window N-sites each cycle (non-overlapping 
windows, eq. ||); i n t ms case, the sequence is much larger, L pa 10 7 . The data points are 
the average values with relative error-bars that vary from 5% for small I to 40% (less than 
20% for non-overlapping windows) of the data point for / pa N/2, hence, we omit them to 
preserve the clarity of the figure. The learning rate used, 77 = 2 (eq. |6|), is not optimal; 
it is obvious that an optimization can reduce the fluctuations in the correlation function 
since it affects the relative change in the weights. We note that the fluctuations inspected 
in the sequences {<x m } generated by the trained networks are similar to those of the training 
patterns £, indicating a finite size effect. This has been confirmed for several network sizes. 

It is interesting to examine how the network (learning algorithm) has embedded the 
relevant information associated with the correlations. At the end of the training process 
described above, we measured the correlation function of the weight vectors (averaged over 
realizations of Dl) in two directions, horizontal (over rows) and vertical (over columns), as 
follows: 



Results are presented in Fig. |2| for a training sequence obeying a power-law correlation 
function with exponent 7 = 0.6 and a network of size N = 300. Clearly, the vertical 
correlations follow a rule similar to that of the sequence, C v (l) ~ /-°- 625 ; while the horizontal 
correlations decay much faster, with an exponential fit Ch(l) ~ exp(— 0.03 I). The case of 
training with patterns obtained from non-overlapping windows is presented for comparison 
by the opaque circles. In this case, the vertical correlations are similar, C v (l) ~ /~ a61 , 




(11) 
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however there are no horizontal correlations. We conclude that sliding one site each cycle 
induces horizontal correlations, however they decay exponentially fast. 

The issue of learning the rule from a teacher is not treated in this framework, however, 
we add the following (numerical) observation: The overlap between two (initially random) 
networks, R = W 1 • W 2 , learning from the same rule (training sequence) decreases with 
the size of the network (N) and remains very low even for L ^> N 2 , although the two 
networks generate sequences with similar correlation functions, asymptotically. The same 
holds when each network learns from a different sequence. We conclude that the network 
indeed learns the statistical properties of the sequence, but not its values. Clearly, in batch 
learning one expects that the network would learn the rule when L = N 2 , since the number 
of free parameters (weights) equals the number of examples. 

III. GENERATING COLORED SEQUENCES BY A COLORED NETWORK 

So far, we demonstrated the capability of the network to capture statistical properties 
from the training sequence. Let us now consider the inverse problem, i.e., that of construct- 
ing a network that is capable of generating correlated sequences. The information obtained 
from the trained networks in the previous section regarding the structure of the weight ma- 
trix, suggests that the significant correlation is present between the elements of a column, 
i.e., vertical direction. Therefore, we would like to compare the correlation function of se- 
quences generated by networks with the same vertical correlation function (power-law), and 
various horizontal decay forms, i.e., power-law with an increasing exponent, 7. The weights 
are constructed as follows. Start by generating a random matrix of normally distributed 
elements. Each column is treated as a ID sequence and is "colored" following the process 
described above for generating a ID correlated sequence. After this stage, the rows are still 
uncorrelated. To achieve a different power-law function for the rows, we treat each row inde- 
pendently as a ID sequence and follow the same procedure as above, this time with (possibly) 
a different exponent. This process generates a weight matrix with pronounced correlations 



7 



in the vertical and horizontal directions only. We normalize the weights, X)fj=i W% * = N , 
such that (3 = 0(1) (independent of N). The value of (3 in the dynamic equations, eqs. 
is taken well above bifurcation to increase the probability of non-periodic attractors [11 



(we carefully avoid the periodic attractors in our measurements). In the analysis described 
below, each sample network (colored W) is initialized at random (S°) and the correlation 
function of the sequence generated is calculated at long times. 

Figure |3| depicts two cases for which the vertical correlation function of the weights decays 
polynomially with exponents j v = 0.4 and j v = 0.6. For each case, the horizontal correlation 
function takes one of the following three values: jh = Jv, lh — or uncorrelated. The 
results were obtained for a network of size N = 2048 and averaged over 50 realizations of 
the weight matrix. Additional averaging is done by starting from several initial conditions 
for each matrix. It is apparent that the symmetric case, jh — 7„, gives rise to a relatively 
poor long-range correlations in the generated sequence. The other two cases exhibit much 
longer-range correlations. Although we are not trying to determine the optimal correlation 
function, it seems that weak correlations are better than lack of horizontal correlations. This 
is in agreement with our findings regarding the trained networks, see Fig. 0. 

To conclude, we propose a naive calculation which should serve as a starting point to 
the analytical investigation of the model. The quantity of interest in our calculation is the 
asymptotic correlation function, 



C(l) =Z {Sj Sj +l ) tw , (12) 

where Z is a normalization factor (C(0) = 1). The average is taken over the time t and the 
realizations of the weights, expecting C(l) to be independent of the site i. For simplicity we 
use the parallel updating rule, eq. [| hence, the stationary state of the correlation function 
may be given by 



N \ IN 



c(i) = z (tanh \pj2 s i w ^j tanh y 3 E s j w i+ijj I ■ ( 13 ) 



The approximation of eq. O consists of linearizing the r.h.s. and assuming S independent 



of the realization of W, leading to 
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N 

C(l) « £ ( S Pl) t (W itj W i+l>k ) w . (14) 
j,k=l 

Next, we identify the averages as the correlation functions, defined above, that depend on 
the distance only, and rewrite eq. using m = k — j in the form 

N 

C{1) = Z C ( m ) C w (l,m) , (15) 

m=l 

where Z is a normalization factor, Cw(l,m) denotes the 2D correlation function of the 
weights, and C(l) is the ID function which is the quantity of interest. In the scenario de- 
scribed above, Cw is known a priori (independent of time) since the weights are constructed. 
If we assume a power-law vertical correlations for Cw and no horizontal correlations, the 
correlation function of the sequence, C(l), simply follows the vertical correlations of the 
weights, i.e., C{1) = Z^ m Cw(}i m )bmfi = ZC\y(l,0) , supporting the above findings. We 
remark that this decomposition of Cw into independent vertical and horizontal functions 
is still a good approximation when Ch is not a delta function, as long as Ch decays much 
faster than C v , which enables us to neglect other correlations. In this case eq. |1^ can be 
formulated as an "Eigen- value problem" of the matrix Cw- Few such cases have been solved 
numerically for which the assumption of decomposition was found consistent, see \T2\. When 
this decomposition is no longer valid, we observe a breakdown of the long-range power-law 
behavior, see Fig. ||- the case 7^ = j v . 



IV. DISCUSSION 

In this paper we analyzed the capability of a neural network model to learn the rule of 
a long-range correlated sequence on the one hand, and a method for constructing a network 
that is able to generate such sequences on the other hand. We demonstrated that a simple 
on-line learning algorithm can be used to extract the rule, provided that the sequence is long 
enough. The fluctuations observed in the training patterns are manifested in the generated 
sequences as well. The investigation of the weights that were obtained during the learning 
process indicates that the vertical correlations (C v , eq. [TT]) play the most important role 
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in generating correlated sequences by the network. Employing this finding, we were able 
to construct networks (without training) that are capable of generating sequences with 
a predefined power-law correlation function. Indeed, we found that additional significant 
horizontal correlations in the constructed networks' weights corrupt this property. These 
observations were confirmed by a naive analytical treatment of the stationary correlation 
function. 

The question of the optimal learning rate (rj) was not treated although it seems to 
be an important parameter in the convergence of the training process. Another question 
which deserves further research regards the analytical derivation of the correlation function. 
Taking into account the nonlinearity of the transfer function is necessary to close the naive 
calculation self-consistently, and to obtain the corrections to the correlation function. 
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FIGURES 



10" 



10" 




10 100 

FIG. 1. The correlation function C(l) (eq. |l0|) of the sequences generated by the trained net- 
works with N = 200. The training patterns are generated from correlated ID sequences with 
7 = 0.4, 0.6, 0.8. C(Z) is shown along with the power-law regression lines; the respective exponents 
are 0.42, 0.63, 0.76. The opaque triangle points correspond to training by sliding N-sites each cycle, 
and the exponent of the regression line is 0.7 (for 7 = 0.8). 
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FIG. 2. The correlation functions of the weights W of trained networks with N = 300. C^(„) 
is averaged over the rows (columns) of the weight matrix (eq. |ll|). The dashed lines correspond 
to regression fits: a power-law C v (l) ~ /-o.625±o.oi6 ^ an( j exponential C^(Z) ~ exp(— a I) 
a = —0.03 ± 0.001. The opaque circles represent C v in the case of training by sliding N-sites each 
cycle for a network with N = 200. The power-law regression fit is C V (V) ~ /-o.6i±o.025 
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FIG. 3. The correlation function of sequences generated by a constructed colored network, 
N = 2048. (a) j v = 0.4 (b) j v = 0.6. 7^ is given in the figure ("no cor" stands for no horizontal 
correlations). The solid lines are the power-law regression fits with exponents: (a) 0.47 (top line), 
0.46 (middle) and 0.5 (bottom) (b) 0.64 (top), 0.64 (middle) and 0.63 (bottom) 
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